Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Univariate Plots Section

## [1] "Frequency distribution of quality"
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

## [1] "Summary of variable fixed.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

## [1] "Summary of variable volatile.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

## [1] "Summary of variable citric.acid"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## [1] "Summary of variable residual.sugar"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

## [1] "Summary of variable chlorides"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

## [1] "Summary of variable free.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

## [1] "Summary of variable total.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

## [1] "Summary of variable density"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

## [1] "Summary of variable pH"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

## [1] "Summary of variable sulphates"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

## [1] "Summary of variable alcohol"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

  1. The distribution of dependent variable quality is slightly left skewed.
  2. Independent variables fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide,total.sulfur.dioxide and alcohol are approximately poisson distributed.
  3. Independent variables residual.sugar, chlorides, and sulphates seems to have long tail on the positive side.
  4. Independent variables density and pH are roughly normally distributed.

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The data set contains information about 1599 red variants of the Portuguese “Vinho Verde” wine. There are twelve variables about each wine.

What is/are the main feature(s) of interest in your dataset?

The varibale quality is the dependent variable, while the rest eleven variables are independent variables. The dependent variable is the one we hope to gain better understand about in the dataset.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

## [1] "correlation between each dependent variable and quality"
##        fixed.acidity     volatile.acidity          citric.acid 
##                0.124                0.391                0.226 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                0.014                0.129                0.051 
## total.sulfur.dioxide              density                   pH 
##                0.185                0.175                0.058 
##            sulphates              alcohol 
##                0.251                0.476

The correlations between any single independent variable and the independent variable are not strong. We possibly will need them working together to help predict the wine quality.

Did you create any new variables from existing variables in the dataset?

I created a variable called leqFive indicating whether the wine has a quality less than or equal to five. The reason I created this variable is that, there are 47% wines with a quality less than or equal to 5 and 53% wines with a quality greater than or equal to 6. Also the proportion of wines that have a quality 5 or 6 is 82% of all wines. So it will be very important if we can distinguish wines with quality less than equal to 5 and wines with quality greater or equal to 6.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I think the dependent variable is better treated as a categorical variable, thus I am turning it into type factor in R.

To predict the wine quality after this transformation, the problem is now a classification problem.

Bivariate Plots Section

## [1] "Summary of variable fixed.acidity By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.230  12.600 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        5     94  18.737   6.283 8.79e-06 ***
## Residuals   1593   4751   2.982                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable volatile.acidity By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5   8.22   1.645   60.91 <2e-16 ***
## Residuals   1593  43.01   0.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable citric.acid By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5   3.53  0.7059   19.69 <2e-16 ***
## Residuals   1593  57.11  0.0359                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable residual.sugar By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value Pr(>F)
## quality        5     10   2.094   1.053  0.385
## Residuals   1593   3166   1.988

## [1] "Summary of variable chlorides By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600 
## [1] "One-way ANOVA test"
##               Df Sum Sq  Mean Sq F value   Pr(>F)    
## quality        5  0.066 0.013162   6.036 1.53e-05 ***
## Residuals   1593  3.474 0.002181                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable free.sulfur.dioxide By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        5   2571   514.1   4.754 0.000257 ***
## Residuals   1593 172274   108.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable total.sulfur.dioxide By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00 
## [1] "One-way ANOVA test"
##               Df  Sum Sq Mean Sq F value Pr(>F)    
## quality        5  128045   25609   25.48 <2e-16 ***
## Residuals   1593 1601155    1005                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable density By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9962  0.9976  0.9975  0.9988  1.0010 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9956  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0030 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0040 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0030 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988 
## [1] "One-way ANOVA test"
##               Df   Sum Sq   Mean Sq F value   Pr(>F)    
## quality        5 0.000230 4.594e-05    13.4 8.12e-13 ***
## Residuals   1593 0.005462 3.430e-06                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable pH By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.162   3.230   3.267   3.350   3.720 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        5   0.51 0.10242   4.342 0.000628 ***
## Residuals   1593  37.58 0.02359                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable sulphates By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5   3.00  0.6000   22.27 <2e-16 ***
## Residuals   1593  42.91  0.0269                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Summary of variable alcohol By quality"
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00 
## [1] "One-way ANOVA test"
##               Df Sum Sq Mean Sq F value Pr(>F)    
## quality        5  483.9   96.79   115.9 <2e-16 ***
## Residuals   1593 1330.8    0.84                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] "Correlation matrix for independent variables"
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000            0.256       0.672
## volatile.acidity             0.256            1.000       0.552
## citric.acid                  0.672            0.552       1.000
## residual.sugar               0.115            0.002       0.144
## chlorides                    0.094            0.061       0.204
## free.sulfur.dioxide          0.154            0.011       0.061
## total.sulfur.dioxide         0.113            0.076       0.036
## density                      0.668            0.022       0.365
## pH                           0.683            0.235       0.542
## sulphates                    0.183            0.261       0.313
## alcohol                      0.062            0.202       0.110
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.115     0.094               0.154
## volatile.acidity              0.002     0.061               0.011
## citric.acid                   0.144     0.204               0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201               0.022
## pH                            0.086     0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042     0.221               0.069
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       0.113   0.668 0.683     0.183   0.062
## volatile.acidity                    0.076   0.022 0.235     0.261   0.202
## citric.acid                         0.036   0.365 0.542     0.313   0.110
## residual.sugar                      0.203   0.355 0.086     0.006   0.042
## chlorides                           0.047   0.201 0.265     0.371   0.221
## free.sulfur.dioxide                 0.668   0.022 0.070     0.052   0.069
## total.sulfur.dioxide                1.000   0.071 0.066     0.043   0.206
## density                             0.071   1.000 0.342     0.149   0.496
## pH                                  0.066   0.342 1.000     0.197   0.206
## sulphates                           0.043   0.149 0.197     1.000   0.094
## alcohol                             0.206   0.496 0.206     0.094   1.000
  1. The values of volatile.acidity, density and pH tend to decrease as the quality of wine get higher.
  2. The values of citric.acid, sulphates and alcohol tend to increase as the quality of wine get higher.
  3. The values of fixed.acidity, residual.sugar and chlorides does not seems to vary with quality.
  4. The values of free.sulfur.dioxide and total.sulfur.dioxid seems to be lower in low quality and high quality wines and higher in middle quality wines.
  5. The absolute value of correlation coefficients between free.sulfur.dioxide and total.sulfur.dioxide as well as between fixed.acidity and citric.acid are higher than 0.6. We might consider only use one of the two correlated variables when building the model.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Variables volatile.acidity, density, pH, citric.acid, sulphates and alcohol tend to change as the quality of wine get higher.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Variables free.sulfur.dioxide and total.sulfur.dioxide as well as fixed.acidity and citric.acid are moderately correlated. We might need to exclude correlated variables in model building.

What was the strongest relationship you found?

The correlation coefficient between free.sulfur.dioxide and total.sulfur.dioxide is 0.683, which is the highest among all possible pairs of variables.

Multivariate Plots Section

In these plots, we are searching for the right combination of independent variables that seems to be able to support a clear sperating line in the plot that distinguish the wines with quality less than or equal to five with those higher than five in quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Judging from the plots, I would say the the combination of sulphates and alcohol, the combination of chlorides and alcohol, the combination of volate.acidity and alcohol, and the combination of volatile.acidity and sulphates seem to able to help us distinguish wines with higher quality(\(\geq 6\)) and wines with lower quality(\(\leq 5\)).

Were there any interesting or surprising interactions between features?

Even though free.sulfur.dioxide and total.sulfur.dioxide are moderately correlated with each other, based on the plots, many low quality(\(leq 5\)) wine tend to have higher value in total.sulfur.dioxide for a given value of free.sulfur.dioxide. So the combination of the two variable seems to be able to provide some explanation for wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

set.seed(0306)
# find number of variables to use
wine.rfcv <- rfcv(trainx = wine[,-c(12,13)],
                  trainy = wine$quality,
                  cv.fold=5)
plot(wine.rfcv$n.var, wine.rfcv$error.cv, pch = 19, type = "b")

# find parameter value `mtry`
wine.tunedRF <- tuneRF(x=wine[,-c(12,13)],
                       y=wine$quality)
## mtry = 3  OOB error = 29.52% 
## Searching left ...
## mtry = 2     OOB error = 29.83% 
## -0.01059322 0.05 
## Searching right ...
## mtry = 6     OOB error = 30.46% 
## -0.03177966 0.05

# fit randomForest model
set.seed(1126)
wine.rf <- randomForest(x=wine[,-c(12,13)],
                        y=wine$qualit,
                        ntree = 1500,
                        mtry = 3,
                        importance = T)

# see the importance of variables
importance(wine.rf)
##                                3          4         5        6        7
## fixed.acidity        -0.56111865 -0.3486899  48.42989 42.03243 46.28220
## volatile.acidity      8.06171651 14.6071562  66.20499 45.71393 74.95099
## citric.acid           3.89390048  3.3848477  41.64058 37.99165 46.74801
## residual.sugar        2.06429993 -2.9363878  47.47231 48.15396 42.69723
## chlorides            -1.68556675 -3.9360030  57.00153 44.74722 40.27894
## free.sulfur.dioxide   0.06058017  2.2928119  46.74999 44.94971 37.86081
## total.sulfur.dioxide  3.32982546  1.9831489  69.41414 62.44532 56.57785
## density              -4.09555243 -6.7015283  53.32682 57.13734 52.91719
## pH                   -0.93451508  6.6218486  48.66793 39.11223 39.63130
## sulphates             1.66961907 12.3961049  75.45682 67.87864 81.22823
## alcohol              -0.15992995  5.4472305 105.11547 68.96970 92.39223
##                              8 MeanDecreaseAccuracy MeanDecreaseGini
## fixed.acidity         6.569794             70.87984         77.95775
## volatile.acidity     12.661250             90.69189        107.14393
## citric.acid          11.550874             66.58893         75.73842
## residual.sugar       10.426693             75.53234         73.23089
## chlorides             9.189612             72.69759         83.28575
## free.sulfur.dioxide   9.015734             70.41771         68.46716
## total.sulfur.dioxide 12.624055             96.21514        106.75614
## density               9.432491             83.12110         94.75334
## pH                    9.925760             70.75903         76.81289
## sulphates            18.908448            109.32273        113.46508
## alcohol              19.247203            130.01911        149.44436
# see in sample prediction confusion matrix
table(wine.rf$predicted, wine$quality)
##    
##       3   4   5   6   7   8
##   3   0   1   0   0   0   0
##   4   1   0   1   1   0   0
##   5   8  36 562 121  10   0
##   6   1  15 112 484  76   9
##   7   0   1   6  32 112   7
##   8   0   0   0   0   1   2
# predciton accuracy
100*round(sum(diag(table(wine.rf$predicted, wine$quality)))/nrow(wine),4)
## [1] 72.55

I built a random forest model using all the indenpendent variables in the original dataset. The model gives a 72.55% in sample prediction accuracy, which is not very great.


Final Plots and Summary

Plot One

Description One

The boundary quality \(\leq 5\) and quality \(\geq 6\) roughly divides the data set into two equal size halfs. 82% percent of the wines are of quality 5 or 6.

Plot Two

Description Two

The median value of variable sulphates, alcohol and citric.acid tends to increase as the quality of the wine gets higher.

Plot Three

Description Three

Even though single independent variable has very weak correlation with the wine quality. Combinations of two variables can support a seperating line that can classify wines with quality lower than or equal to 5 and wines with quality higher than 5.

Reflection

The purpose of this data exploration is to identify the variables to be used to build model to predict wine quality. We find that no single variable can be used to indicate the wine quality well enough. Using combinations of variables we can get better ideal of the wine quality. I used random froest model to perform feature selection, the results suggest that we need to use all variables in hand. Based on the prediction results on the sample, most classification error occurs with quality 5 and quality 6. We might need to dive deeper to investigate in that direction.